A text representation language for contextual and distributional processing
نویسنده
چکیده
This thesis examines distributional and contextual aspects of linguistic processing in relation to traditional symbolic approaches. Distributional processing is more commonly associated with statistical methods, while an integrated representation of context spanning document and syntactic structure is lacking in current linguistic representations. This thesis addresses both issues through a novel symbolic text representation language. The text representation language encodes information from all levels of linguistic analysis in a semantically motivated form. Using object-oriented constructs in a recursive structure that can be derived from the syntactic parse, the language provides a common interface for symbolic and distributional processing. A key feature of the language is a recursive treatment of context at all levels of representation. The thesis gives a detailed account of the form and syntax of the language, as well as a treatment of several important constructions. Comparisons are made with other linguistic and semantic representations, and several of the distinguishing features are demonstrated through experiments. The treatment of context in the representation language is discussed at length. The recursive structure employed in the representation is explained and motivated by issues involving document structure. Applications of the contextual representation in symbolic processing are demonstrated through several experiments. Distributional processing is introduced using traditional statistical techniques to measure semantic similarity. Several extant similarity metrics are evaluated using a novel evaluation metric involving adjective antonyms. The results provide several insights into the nature of distributional processing, and this motivates a new approach based on characteristic adjectives. Characteristic adjectives are distributionally derived and semantically differentiated vectors associated with a node in a semantic taxonomy. They are significantly lowerdimensioned then their undifferentiated source vectors, while retaining a strong correlation to their position in the semantic space. Their properties and derivation are described in detail and an experimental evaluation of their semantic content is presented. Finally, the distributional techniques to derive characteristic adjectives are extended to encompass symbolic processing. Rules involving several types of symbolic patterns are distributionally derived from a source corpus, and applied to the text representation language. Polysemy is addressed in the derivation by limiting distributional information to monosemous words. The derived rules show a significant improvement at disambiguating nouns in a test corpus.
منابع مشابه
The Impact of Contextual Clue Selection on Inference
Linguistic information can be conveyed in the form of speech and written text, but it is the content of the message that is ultimately essential for higher-level processes in language comprehension, such as making inferences and associations between text information and knowledge about the world. Linguistically, inference is the shovel that allows receivers to dig meaning out from the text with...
متن کاملSituation and Text: Representation of Migrants Whilst the Escalation of Refugee Crisis in Great Britain as Compared to Russia
Increasing migration is a vital concern for a globalizing sociocultural environment in today’s world. The UK and developed European countries have become an attractive destination for asylum seekers (labelled as “migrants”) in the past decade. The rapid rise in the number of asylum seekers, which was labelled “migration crisis” (Ruz, 2015), made this topic an integral part of scientific discuss...
متن کاملNamed Entity Recognition in Persian Text using Deep Learning
Named entities recognition is a fundamental task in the field of natural language processing. It is also known as a subset of information extraction. The process of recognizing named entities aims at finding proper nouns in the text and classifying them into predetermined classes such as names of people, organizations, and places. In this paper, we propose a named entity recognizer which benefi...
متن کاملCan Network Embedding of Distributional Thesaurus be Combined with Word Vectors for Better Representation?
Distributed representations of words learned from text have proved to be successful in various natural language processing tasks in recent times. While some methods represent words as vectors computed from text using predictive model (Word2vec) or dense count based model (GloVe), others attempt to represent these in a distributional thesaurus network structure where the neighborhood of a word i...
متن کاملIntrospective Study of Emotion Icon in Public Chat as a Gesture of Texting
An emotion icon, better known as emoticon is a metacommunicative pictorial representation of a facial expression that, in the absence of body language and prosody, serves to draw a receiver's attention to the tenor or temper of a sender's nominal verbal communication, changing and improving its interpretation. The present study investigates the use of these nonverbal cues in whatsapp public cha...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2010